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Abstract (Article Summary) 

SpeedTracer, a World Wide Web usage min ing and analysis t ool, was developed to understand user surfing 
behavior by exploring the Web server log files with dafa mining techniques. With innovative algorithms, 
SpeedTracer first identifies user sessions by reconstructing user traversal paths. It does not require "cookies" or 
user registration for session identification. User privacy is protected. Once user sessions are identified, data mining 
algorithms are then applied to discover the most common traversal paths and groups of pages frequently visited 
together. Three types of reports are prepared: user-based rep orts, path-based re ports, and group- based reports. 
The design of SpeedTracer is described and some of its features are demonstrated with a tew sample reports. 

Full Text (7845 words) 

Copyright International Business Machines Corporation 1998 



SpeedTracer, a World Wide Web usage mining and analysis tool, was developed to understand user surfing 
behavior by exploring the Web server log files with data mining techniques. As the popularity of the Web has 
exploded, there is a strong desire to understand user surfing behavior. However, it is difficult to perform user- 
oriented data mining and analysis directly on the server log files because they tend to be ambiguous and 
incomplete. With innovative algorithms, SpeedTracer first identifies user sessions by reconstructing user traversa l 
paths . It does not require "cookies" or user registration for session identification. User privacy is protected. Once 
user sessions are identified, data mining algorithms are then applied to discover the most common traversal paths 
and groups of pages frequently visited together. Important user browsing patterns are manifested through the 
frequent traversal paths and page groups, helping the understanding of user surfing behavior. Three types of reports 
are prepared: user-based reports, path-based reports and group-based reports. In this paper, we describe the 
design of SpeedTracer and demonstrate some of its features with a few sample reports. 

Popularity of the World Wide Web (www) on the Internet has exploded recently. Many organizations have invested a 
tremendous amount of capital to operate sites on the Web. These Web sites provide communications and services 
to their employees, customers, and suppliers. With money invested in these sites, there is a strong desire to 
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understand the effectiveness of such investments and to find ways to realize the potential opportunities provided by 
the Internet. As a result, it has become important to understand user surfing behavior. 

To understand how visitors navigate a Web site, the Web server log files are analyzed. However, it is generally 
difficult to perform user-oriented data mining or analysis directly on the server log files because they tend to be 
ambiguous and incomplete. Typical server log files contain the following information about a request: client host 
Internet Protocol (IP) addres s, ti me stamp, method , URL1 (uniform resource locator) address of the requested 
document, HT[PZ (Hypertext Transfer Protocol) version, return code (status of the request, i.e., success or error 
codes), bytes transferred, referrer page URL, and agent (browser and client operating system). The user identifier is 
usually not available in the log file. Due to the use of proxy servers by Internet Service Providers (ISPs) and firewalls 
by commercial corporate gateways, true client IP addresses are not available to the Web server. Instead of various 
distinct client IPs, the same proxy server or firewall IP will be recorded in the server log files, representing requests 
of different users who come to the Web site through the same proxy server or firewall. This situation creates 
ambiguity in the log records. Furthermore, some Web pages are generally cached by local clients or various proxy 
servers, or both, in order to reduce network traffic. As a result, log records will be missing for the corresponding 
accesses to the cached Web pages, resulting in an incomplete log. A more complete discussion of the difficulties in 
obtaining reliable usage data on the Web can be found in Reference 3. 



For example, Table 1 shows a few sample entries of an access log in the combined log format from a National 
Center for Supercomputing Applications (NCSA)4 HTTPd.5 The first entry in Table 1 represents a GET request from 
a user going through peo-ill21.ix.netcom.com for file /images/nudge.gif following HTTP/1.0 protocol. The user may 
or may not be physically logged in on the machine peo-ill-21.ix.netcom.com. He or she may be just using the 
machine as a gateway to the Internet. The file size of nudge.gif is 37 bytes, and it was successfully transferred. The 
agent used to view page nudge.gif is MSIE** 3.01 (Microsoft Internet Explorer** 3.01) running on Windows NT**. 
Finally, the user was referred to the "gif, file from http: //www.internet.ibm.com/. Namely, either file nudge.gif is on 
the home page of http://www.internet.ibm.com/ or there is a hyperlink to it from the home page. 

To solve the problem of proxy servers or firewalls masking user IPs, it generally requires either user registrations or 
log-ins or the employment of "cookies" between the Web server and client browsers. With log-ins or cookies, a Web 
server can identify distinct requests made by individual users through a token carried between the user's browser 
and the server. 

But the desire by many, if not the majority of, users to have privacy and remain as anonymous as possible may 
force many Web servers not to ask for registration or not to use cookies. As a result, there is a strong need for a tool 
that can analyze user-oriented behavior from the regular server log files without requiring cookies or registrations. 
SpeedTracer,6 a Web usage mining and analysis tool, has been developed for such a purpose. 

Several Web server log analysis tools have been implemented. Some of these tools are very simple and do not 
attempt to identify individual user sessions. These packages are simply mechanisms through which a Web master 
can view the raw Web s erver statistics, s uch as hit co unts and distribution s based on geog raphic re gions. Examples 
— 9 of this type of tool includ e wwwstat (http: //www.ics.uci.edu/pub/websoft/ wwwstat) and Analog 
=y (http://www.statslab.cam.ac. uk/~sret1 /analog). * 

To provid e user-oriente d Web usage analysis, user sessions must first be identified. More sophisticated analysis 
packages identify user sessions with some or all of the following three mechanisms. First, if the Web server provides 
cookies, it is a trivial task to formulate the session. Every access to the Web server with the same cookie value 
makes a single session. Second, if the server does not provide cookies, it may require a log-in ID for each browser. 
The analysis tool can use the log-in IDs to identify sessions. In Reference 7, a data mining tool was developed on 
the assumption that log-in IDs are available. Without log-in IDs, the data mining tool cannot perform its intended 
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functions. In fact, most log records do not contain log-in IDs. Lastly, if the Web server does not provide either 
cookies or user IDs, the analyzer identifies sessions with host addresses. All accesses to the Web server from a 
given host address are considered to be a session until a predefined amount of time has passed between accesses. 
As mentioned previously, the use of a proxy and firewall causes all browsers from a given proxy or firewall to be 
considered a single user. As a result, an identified session may in fact contain many independent user sessions. 
Several Web analyzers only use the host address to identify sessions, such as SurfReport** 
(http://software.bienlogic.com/SurfReport) and NetTracker** (http://www.sane.com/products/ NetTracker). Other 
tools use a combination of methods to identify sessions, such as the Usage Analyst** by Microsoft Corporation 
(previously Interse http://www.interse.com) and WebTrends** (http: //www. webtrends.com). 

In contrast, SpeedTracer uses the referrer page and the URL of the requested page as a traversal step and 
reconstructs the user traversal paths for session identification. No "cookies" or user registration are required. In 
Reference 8, an alternative approach to reconstructing user traversal paths was proposed. Instead of using a 
referrer page, information about the topology (i.e., hyperlink structure) of a Web site (together with other heuristics) 
was used to identify legitimate traversals. A software agent was first used to perform an exhaustive breadth-first 
traversal of pages within the Web site in order to construct the topology. However, the topology is not really needed 
if referrer information is available. 

Once user sessions are identified, statistics related to user behavior can be obtained. Interesting user- based 
statistics include th e top N referre rs to a Web site, the top N page s most frequently visited by users, the top N pages 
from or into which users most frequently exit or enter a Web site, the top N browsers most frequently used, the top N 
HP hosts from which most users come, the demographics (by organization or by country) of users, the distribution of 
user session durations, the distribution of numbers of pages visited during a user session, and the distribution of 
depth or breadth of a user session. 

With user sessions, data mining techniques can be applied to obtain interesting user browsing patterns. Data mining 
has recently been used to discover customer buying patterns by many retailers and other service corporations. One 
of the most important data mining problems concerns mining association rules. 9-12 Given a set of transactions, 
where each transaction is a set of items, an association rule is an expression of the form X -> Y, where X and Y are 
sets of items. An example of an association rule is: "30 percent of transactions that contain bread and butter also 
contain milk; 2 percent of all transactions contain both of these items." Here 30 percent is called the confidence of 
the rule, and 2 percent the support of the rule. The thrust of mining association rules is to find all association rules 
that satisfy user-specified minimum support and minimum confidence constraints. In mining association rules, the 
most important problem is to generate all combinations of items that have the minimal support. These combinations 
of items are called large itemsets. 

In SpeedTracer, we mapped each identified user session into a transaction and then applied data mining 
techniques to discover the top N most frequented user traversal paths and the top N groups of pages most 
frequ ently visited to gether. These problems are to some extent similar to finding the top N large itemsets for 
traversal paths and groups of pages. But specific differences exist. A traversal path is a collection of consecutive 
URL pages in a Web presentation, where one URL is referred to by the immediately preceding URL. The URLs in a 
traversal path are connected in the Web presentation graph. In contrast, the pages in a group are not necessarily 
connected among themselves. A frequently visited group of pages may contain two or more disjoint traversal paths. 
By examining the traversal paths and groups of pages, valuable user browsing patterns can be obtained to improve 
the organization and linkage of the Web presentation. 

Note that finding frequent traversal paths is also to some extent similar to the problem of mining sequential 
patterns. 13 However, the results from the sequential-pattern-mining method in Reference 13 may contain 
sequences that do not represent a traversal path in the Web presentation graph. The reason is there may be many 
backward traversal steps involved in a user session, and pages on two different paths may be recognized as part of 
a sequential pattern. In this paper we focus on mining the frequent traversal paths and groups of pages visited 
together. However, other types of data mining techniques, such as clustering pages, clustering users, or 
classification, could be applied to the derived user sessions. These and other techniques will be explored in future 
work. 

The next section describes the design and implementation of SpeedTracer. The implementation details of session 
identification, mining of frequent traversal paths, and mining of groups of pages most frequently visited together are 
presented. The section following the next one shows a few sample reports from SpeedTracer. User reports, path 
reports, and group reports are highlighted. Finally, we conclude with a summary. 
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Figure 1 



Design of SpeedTracer 

In SpeedTracer, we first process the access, referrer, and agent logs (or just access log, if it is in combined log 
format) to identify user sessions. If referrer and agent information are stored in separate files, delicate 
synchronization procedures may be needed to relate an access log record to the corresponding referrer and agent 
records because log entries may be missing in some of the files. For example, referrer or agent log records, or both, 
may be missing from NCSA logs. However, in IBM's ICS (Internet Connection Server) logs, every access log record 
has corresponding referrer and agent log records even if they are stored in separate files. SpeedTracer is designed 
to process different kinds of log formats, such as the NCSA log, the IBM ICS log, and others. For simplicity, in this 
paper we use the IBM ICS log as an example. Figure 1 shows the flow diagram of SpeedTracer implementation. 
After log records are processed, user sessions are identified using advanced inference algorithms. Interesting user- 
based statistics are then computed based on these sessions. Next we apply data mining techniques to discover the 
top N most frequented user traversal paths. Finally, we discover the top N groups of pages most frequently visited 
together. 

User session identification. Given the information available in the server log, a possible approach to grouping 
"various accesses into user sessions is to use bot h time stam ps and agent information. For example, a user session 
could include all accesses within a predetermined interval if their agents are the same. Unfortunately, this approach 
cannot distinguish two different clients with the same agent coming from the same proxy within the specified time 
interval. For instance, in Table 1 the two accesses from ss5-08. inre.asu.edu both use Netscape Navigator** 3.0 
(Mozilla/3.0) and come within four seconds. The two accesses can be viewed as from the same user session. But 
they may be from two different user sessions going through the same proxy server. As the markets for both 
browsers and desktop operating systems become ever more consolidated, it is highly likely that multiple accesses 
from different users will have the same agent. For instance, most home users may use the same version of 
Netscape Navigator browser running on a Windows 95** desktop. Thus, time stamps together with agent information 
are not sufficient to identify user sessions from the server log files. 

In SpeedTracer we use five key pieces of information from a log record to identify user sessions . They are IP, 
l imestamp, URL (the requestea page), Keterrai, ana Agent. Different ims or agents obviously indicate different user 
sessions. If the time stamps indicate that two accesses are separated by more than a prespecified period of time, 
the accesses are also considered to belong to different sessions. In addition to these obvious rules, SpeedTracer 
uses the referral page to help more accurately identify user sessions. For each log record, we use the referral page 
and the requested page URL to form a hyperlink access pair, representing a step in a user traversal path. Each 
access pair is then used to reconstruct a user traversal path in the Web presentation. The basic idea is that access 
pairs constitute a connected traversal path during a user session. Note that the traversal path can be forward or 
backward. Session identification becomes the partitioning of log records into groups so that the access pairs within a 
group form a connected traversal path. However, because browsers and proxy servers generally use caching to 
reduce network traffic and improve performance, there are no corresponding log entries for those accesses to the 
cached pages. As a result, missing access pairs might be in the log files, and these missing access pairs need to be 
added back during session identification. 

In session identification, we process the log records one at a time. Each access pair is added to an active session, if 
possible. If (x A sub i A -> y A sub i A ) represents an access pair, then the traversal path of a session S of size n can be 
expressed as follows: 

S: (x A sub 1 A -> y A sub 1 A ), (x A sub 2 A -> y A sub 2 A ), . . . , (x A sub n A -> y A sub n A ) 
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where x A sub i + 1 A , = y A sub i A , 1 <= i < n. A new access pair (x A sub j A -> y A sub j A can be appended to an active 
session S, if x A sub j A = y A sub k A , 1 <= k <= n, or x A sub 1 A = x A sub j A . However, unless x A sub j A = y A sub n A , a 

backward access path (y A sub n A -> x A sub n A ) (y A sub k = 1 A -> x A sub k + 1 A ) must first be added back to S to 

maintain a connected traversal path. For example, if (b -> d) were to be appended to a session S A sub i A : (a -> b), (b 
-> c), a backward traversal pair (c -> b) has to be first appended to S A sub i A . Thus, the new S A sub i A becomes (a -> 

b) , (b -> c), (c -> b), (b -> d). 

Apparently, there can be multiple candidate sessions to which a new access pair can be appended. Different criteria 
or combinations of them can be used to choose one candidate session. For example, one criterion can be the 
number of backward access pairs needed to be added. Another criterion can be the time-stamp difference between 
the access pair and a session. The time stamp of a session is the time stamp of its latest appended access pair. 
Combinations of these two criteria can also be used. For example, one can choose the session with the smallest 
time-stamp difference with backward access pairs no more than m, or the one with the smallest number of backward 
access pairs with time-stamp difference no more than q minutes. Advanced inference algorithms are developed for 
this purpose. 

As an example, Table 2 shows the key information of eight example log records used by SpeedTracer for session 
identification. These eight log records represent requests coming from a gateway for the IBM Watson Research 
Center. From Table 2, the hyperlink access pairs for the eight log records are (- -> a), (e -> b), (b -> c), (- -> b), (b -> 

c) , (- -> f ), (a -> b), (b -> g), respectively. Here, "-" means that no referral page is available for this access. Using 
these access pairs and the agent information, we can identify four user sessions as follows: S A sub 1 A : (- -> a), (a -> 

b) , (b-> g) from log records 1 1 7, and 8; S A sub 2 A : (e -> b), (b -> c) from log records 2 and 3; S A sub 3 A : (- -> b), (b -> 

c) from log records 4 and 5; and S A sub4 A : (- -> f) from log record 6. Note that if we were to use only time stamp and 
agent for session identification, we would have grouped log records 1, 2, 3, 7, and 8 as a user session and log 
records 4, 5, and 6 as another user session. However, from the referral information of both log records 1 and 2, it is 
obvious that these two are from different user sessions. The access to page b in log record 2 must be following a 
previous access to page e, whereas the access to page a may be the beginning of a new user session since it did 
not contain a referral. 

The need for inferring backward access pairs in session identification due to proxy or client caching can be 
illustrated with the following example. Figure 2 shows a traversal pattern by a Web user during a surfing session. 
Note that any access pair (x -> y) can be the result of clicking a hyperlink to page y on page x or by clicking on the 
backward button on the Web browser to page y after the viewer has looked at page x. The user session represented 
by the connected traversal path in Figure 2 can be described as follows: (- -> a), (a -> b), (b -> c), (c -> d), (d -> c), 
(c -> b), (b -> e), (e -> b), (b -> a), and (a -> f ). However, since browsers usually cache recently visited pages, some 
of the actual traversal steps may be missing from the server log files. For example, in Figure 2, (d -> c), (c -> b), (e - 
> b), and (b -> a) may be missing. These missing traversal steps may need to be inferred in order to identify 
traversal paths and user sessions. 
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Figure 2 
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Table 2 



Since a "gif ' or "jpg" file typically does not expand a traversal path, we eliminate all log records whose URL contains 
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these graphical files in our session identification. SpeedTracer also takes care of special cases caused by a user's 
clicking on the "reload" button and "bookmarking" his or her hot links. On a reload, the repeated access pair is 
discarded since it does not expand a traversal path. If an access is the result of a user accessing the Web page 
through his or her bookmark or directly typing in the URL, no referrer information is available on the log record. In 
SpeedTracer, we view this as the beginning of a new session. Once sessions are identified, user-oriented statistics 
can be obtained. 

Interesting user-based statistics are provided bv SpeedT racer. including the most frequent N external referrers to a 
site, the most frequent N visited pages by users, the most frequent N pages that users most often come into and exit 
from a site, the top N hosts from which most users come to visit a site, the distribution of user session durations, and 
the number of pages visited in a session. Sample reports and their applications will be presented in the next section. 
N can be specified by a user of SpeedTracer before it analyzes the log files and prepares the reports. 

Mining frequent traversal paths . Once user sessions are identified, the problem of mining frequent traversal paths 
becomes a matter of discovering the most frequent subpaths common among all the sessions. In finding user 
traversal patterns, we are only interested in forward traversal subpaths. As a result, SpeedTracer first finds all 
maximum forward paths in each user session, and then discovers all common subpaths among all the maximum 
forward paths of user sessions. A maximum forward path is a sequence of maximum connected pages in a Web 
presentation where no page is previously visited. 14 For example, in Figure 2, three maximum forward paths are in 
this session: (1) (a -> b) (b -> c) (c -> d); (2) (a -> b) (b -> e); and (3) (a -> f). Figure 3 shows the algorithm for finding 
all maximum forward paths from a user session. Assume {x A sub 1, . . . , x A sub m A } represents a user session and 
{y A sub 1 .... , y A sub j-1 A } represents a string holding a potential maximum forward path. The idea is to examine 
each page xi in the session one at a time and try to expand the potential maximum forward path by copying x A sub i A 
to y A sub j A , if x A sub i A is not equal to any y A sub k A for every 1 <= k < j. Namely, the pages in the potential maximum 
forward path are all distinct pages, and we are going in the forward direction in the user traversal path. We use a 
flag to indicate that we are currently moving in the forward direction in path traversal. In contrast, if x A sub i A is equal 
to some y A sub k A , 1 <= k < j, then we are going backward in the traversal path, and subpath {y A sub 1 , . . . , y A sub j- 
1} can be a maximum forward path if the flag indicates that we have been going in the forward direction before this 

step. After discovering matched page y A sub k A , we eliminate pages {y A sub k = 1 A y A sub j-1 A } from the potential 

maximum forward path by moving j backward to k + 1 for the next iteration, and set the flag to indicate backward 
direction. At the end, if the flag indicates forward direction, the final subpath is the final maximum forward path for 
the session. 

As an example, Table 3 shows the values of subpath {y A sub 1 A , . . . , y A sub j-1 A } and the flag at the end of each 
execution step of finding maximum forward traversal paths. We use the user session in Figure 2 as our input. If we 
represent the session as a sequence of pages visited, this session is {a, b, c, d, c, b, e, b, a, f}, and the three 
maximum forward paths identified are {a, b, c, d}, {a, b, e}, and {a, f}. For the first four steps, we are going in the 
forward direction and expanding the potential maximum forward path. In step 5, page c is found in subpath {a, b, c, 
d}, so a maximum forward path is found, and the traversing direction is reversed. Such a reversal lasts for two steps 
until step 7 when page e is expanded again to form {a, b, e}. In step 8, page b forces {a, b, e} out as another 
maximum forward path and reverses the traversal direction. At the end, {a, f} is found as a maximum forward path 
since the flag indicates forward direction. 

Once the maximum forward paths are constructed for each session, we then map the problem of finding the top N 
frequent traversal paths into the one of finding frequently occurring consecutive subsequences among the maximal 
forward paths of all user sessions. A large traversal path is a sequence of consecutive pages that appeared in the 
maximal forward paths of a sufficient number of sessions. The number of sessions in which a large traversal path 
appears is called its support. A large traversal path of size k contains k pages. In this paper, we denote the set of top 
M large traversal paths of size k as LP A sub k A . 

Note that a significant difference exists between discovering large itemsets in mining association rules and 
discovering large traversal paths in mining traversal patterns. In a large traversal path, the pages must form a 
consecutive sequence in a maximal forward path, whereas a large itemset in mining association rules is just a set of 
items in a transaction. 

Assume that P A sub k,M A is the Mth largest traversal path in LP A sub k A (the support of P A sub k,M A is the Mth 
largest); s A sub k A is the support of Pk.M; Fi is the set of maximum forward paths for session S A sub i A ; and {x A sub 
1 A , x A sub 2 A , . . . , x A sub m} is the sequence of pages representing a maximum forward path in F A sub i A . The 
algorithm for constructing LP A sub k A , k > 1 can be described as in Figure 4. After each LP A sub k A is constructed, the 
top N traversal paths are then reported. In SpeedTracer, we set M to be greater than N in computing each LP A sub 
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k\ Namely, we computed more large traversal paths for each LP A sub k A than we reported. LP A sub 1 A , is the set of 
all single pages, and s A sub 1 A is the number of user sessions that referenced the Mth hottest page. 
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Table 3 



The idea of constructing LP A sub k A is to find a candidate path of size k, {x A sub j A , . . . , x A sub j=k-1}, from a 
maximum forward path and then compute its occurrence count among the maximum forward paths of all user 
sessions. The condition is that the two consecutive subsequences of size k - 1 , {x A sub j A , . . . , x A sub j+k-2} and 

{x A sub j=1 x A sub j+k-1} are among the top M largest traversal paths in LP A sub k-1\ For example, assume 

session S A sub 1 A , contains two maximum forward paths: {A, B, C, D, E} and {G, H}. To consider candidate large 
paths of size 3 for inclusion in LP A sub 3 A , we would test three candidate large paths: {A, B, C}, {B, C, D}, and {C, D, 
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E}. If both {A, B} and {B, C} are among the top M large paths in LP A sub 2 A , then {A, B, C} is a candidate for LP A sub 
3 A . Similarly, if both {B, C} and {C, D} are in LP A sub 2 A , then {B, C, D} can be included in LP A sub 3 A . 

In Figure 4, for each candidate large path of size k, {x A sub j, . . . , x A sub j+k-1 A } from the maximum forward paths of 
a user session, we increment its occurrence count if it already is in LP A sub k A . There are (m - k + 1) total 
consecutive subsequences of size k from {x A sub 1 , x A sub 2 A , . . . , x A sub m A } . For example, there are three such 
candidate consecutive subsequences of size 3 ({A, B, C}, {B, C, D}, and {C, D, E}) from a maximum forward path of 
size 5 ({A, B, C, D, E}) as we just illustrated above. We examine each one of them. If the subsequence of size k has 
not already been included in LP A sub k A , we test to see if it can be. If yes, we add {x A sub j A , . . . , x A j+k-1 A }; otherwise 
we do nothing. The conditions here are based on the fact that if a traversal path of size k is among the top M largest 
in LP A sub k A , then its two consecutive subsequences of size k - 1 must be also among the top M largest in LP A sub 
k-1 A . Obviously, if m < k, nothing needs to be done for this maximum forward path. For instance, nothing needs to 
be done for {G, H} for LP A sub 3\ Note that for each k, all the maximum forward paths of all user sessions are 
scanned only once. 

Mining groups of pages most frequently visite d. Frequent traversal paths identify pages that are on the same 
forward path in a Web presentation. These pages represent consecutive subsequences in the maximum forward 
paths of user sessions. However, there might be groups of pages not on the same traversal path but frequently 
visited together by users. By examining both frequented traversal paths and frequently visited groups, valuable 
information can be obtained to improve the organization and linkage of the Web presentation. For example, in 
Figure 2, pages a, b, e, and f may be visited most frequently by users, but these four pages are not on the same 
path in the Web presentation. Thus, it may be better to provide an HTTP link from page e to page f so that most 
users would not have to traverse backwards from page e to b, then to page a before they can go to page f. 

To mine th e frequently visited groups from user sessions , we need the distinct pages in each session. Thus, any 
fluplicauori Of pages caused by backward traversals was first eliminated in each session. Unlike traversal paths 
where the ordering of pages in a sequence is important, there is no ordering in a group of pages. Similar to mining 
the traversal paths described above, let us assume that LG A sub k A is the set of top M largest groups, each 
consisting of k pages, and the support of the smallest group in LG A sub k A is Sk. Unlike mining traversal paths, 
however, L G A sub k+1 A , cannot be efficiently constructed directly because of the numerous possible combinations 
of candidate groups of size k + 1 from each session. For example, a maximum forward path {x A sub 1 A , . . . , x A sub 
m A } has (m - k + 1) candidate traversal paths of size k. But a session {x A sub 1\ . . . , x A sub m A } will have C A sup 
m AA sub k A (or m!/k!(m - k)!) candidate groups of size k. As a result, SpeedTracer constructed LG A sub k+1 A by (1) 
generating a set of candidate groups of size k + 1, denoted as CG A sub k+1 A , from LG A sub k A , and (2) counting the 
occurrences of each group in CG A sub k+1 A against all sessions. The approach to mining frequently visited groups of 
pages is thus similar to the discovery of large itemsets in most prior literature of mining association rules.9,10,12 
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Figure 7 



Similar to LP A sub 1 A , LG A sub 1 A contains the top M hottest pages referenced by user sessions. As pointed out in 
Reference 12, because of the nature of C A sup M AA sub 2 A , the computations of CG A sub 2 A and LG A sub 2 A can be 
substantially more demanding than other CG A sub k A and LG A sub k A fork > 2, This is in contrast to the case of 
mining traversal paths, where the candidate paths of LP A sub 2 A cannot be more than the number of links in the Web 
presentation. Therefore, special treatment is needed. 12 

The task of generating CG A sub k A (candidate groups set) from LG A sub k-1 A can be described as in Figure 5. Note 
that we generate CG A sub k A from LG A sub k-1 A . To simplifyenumerating all possible combinations of groups, we sort 
the M groups in LG A sub k-1 A , based on their lexicographical order. The basic idea here is to find all possible groups 
of size k from LG A sub k-1 A , based on the fact that for a group of size k to be a candidate group in CG A sub k A , all its 
subgroups of size k -1 must be in L G A sub k-1 A . 9,10 So, we first try to construct a group of size k for each {x A sub 

1 A , . . . , x A sub k-1 A } in LG A sub k-1 A by finding all the {y A sub 1 y A sub k-1 A } in LG A sub k-1 A such that x A sub 2 A = 

y A sub 1 A , . . . , x A sub k-1 A = y A sub k-2 A . The new k-sized group is thus an expansion of {x A sub 1 A , . . . , x A sub k-1 A } 
with {y A sub k-1 A }. In order for such new k-sized group to be included into CG A sub k A , all the combinations of (k - 1)- 
sized subgroups of it must all be in LG A sub k-1 A . 




Figure 8 
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Figure 9 



Once CG A sub k A is generated, we scan all the user sessions one at a time and increment the occurrence count of 
each candidate group in CG A sub k A if a session contains all the pages in the group. Upon completion, the top M 
candidate groups in CG A sub k A become LG A sub k A . 




Now that we have described the implementation of SpeedTracer, we present some sample reports here. Three 
types of reports are generated by SpeedTracer: user reports, path reports, and group reports. These reports are 
generated in H TML format so that one can view the reports throug h a browser from the Internet or an intranet (see 
Figure 6). Hot links are also provided so that one can click on them th rough a browser and go to the original pages 
t o see what thev are. Java** applets are used to show van ous'charts in th6 user reports. ■ 

To demonstrate some of the features of SpeedTracer, we processed the server log files generated at one of the 
IBM Web sites running IBM ICS. There were 37 984 entries in the access, referrer, and agent log. Note that the 
three IBM ICS logs contain exactly the same number of entries. However, there might be a big discrepancy among 
the three log files from other Web servers, such as the NCSA HTTPd. Sometimes log entries are simply dropped by 
the Web server in one of the files. Special logic in SpeedTracer is designed to synchronize the three files. 

Figure 7 shows a snapshot of two statistics from the user reports. One is the top 10 most frequent external referrers 
to the server site, and the other is the top 10 most frequently visited pages. Note that "external" is with respect to the 
server site whose log files are being analyzed. The largest user count in the external referrer table is "no referrer." 
When a user visits a URL from his or her bookmark or by directly typing in the URL, there is no referrer information 
for such an access. But, the large count is due mostly to the fact that many of the accesses involve CGI (Common 
Gateway Interface) programs. As a result, these accesses were treated as single- page sessions. The most 
frequent external referrer table can be used to measure the effectiveness of Web advertisements. It indicates the top 
external URLs from which most user sessions begin. If one has paid to place an advertisement on a certain site but 
the user count for this external referral is consistently low, one immediately realizes that the money was not well 
spent. 



In the table for the most frequently visited pages in Figure 7, both click counts and user counts are provided. 
Differences between click and user counts do exist, and some of them can be substantial. As expected, the most 
frequently visited page by user count is the home page (/). However, on other experiments, we found that the home 
page is not always the hottest page visited by users because some users may have bookmarked some page other 
than the home page and go directly to it. In fact, this can be verified by the statistics that show the top pages to 
which users most often enter a Web site as provided by SpeedTracer (not shown here). 
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Figure 8 shows other interesting statistics from t he user reports: the top 20 IP/host names from which most users 
come to visit the Web site, the total number of user sessions from each IP/host, the average duration, and the 
number of pages visited for these user sessions. There are a lot of single-page sessions because of the lack of 
referrer information. The overall distributions of user session duration and number of pages visited by users are also 
provided in Figure 9. The majority of the user sessions last less than 10 minutes and are visited by less than 10 
pages. Java applets were used to draw charts on the user's browser based automatically on the data in the report. 

Figure 10 shows sample statistics fro m the path reports. Sp eedTracer presents the top N most frequently visited 
paths with different numbers of links. In Figure 10, we only present the top 10 most frequently visited paths with two 
links. These paths are forward paths, meaning that one can follow the path to visit each page. Figure 11 , in contrast, 
shows the top 10 most frequently visited groups consisting of three pages. These pages may or may not be on the 
same path. Even if they are, they may be visited at various times via other intermediate pages. For example, the top 
path in Figure 10 and the top group in Figure 1 1 contain the same three pages. But their user counts are different. 
The user count for the path is less than that for the group of pages because there are many different ways to visit 
these three pages. 

By comparing Figures 10 and 1 1 , we notice that only the first group and the lOth group in Figure 1 1 are also among 
the top 10 frequented paths in Figure 10. However, seven of these eight remaining groups contain pages on the top 
10 paths with five links (see Figure 12) except (/, /ics/icfgive.htm, and /cgibin/htimage/gifs/anim18.map). Such 
findings suggest that many users may have traveled very deep through various paths before they found the 
commonly desired pages. A simplified design to shorten the depth of the traversal paths might be warranted. Since 
HT=P links are typically embedded in a very complex way, an examination of both frequently visited paths and 
groups can help a Web site to better organize its presentation. 

Summary 
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In this paper, we described the design of SpeedTracer, a Web usage mining and analysis tool. It reconstructs user 
traversal paths to identity user sessions even if user identities are hidden behind proxy servers or firewalls. No 
"cookies" or user registration are required for user session identification. User- oriented statistics are provided, such 
as the most frequent external referrers, the most frequently visited pages, the distributions of user session durations 
and number of pages visited. In addition, the most frequented traversal paths and the most frequently visited group 
of pages are also reported by SpeedTracer. We also presented a few snapshots of sample reports generated with 
SpeedTracer. 
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SpeedTracer, a World Wide Web usage mining and analysis tool, was developed to understand user surfing behavior by exploring 
the Web server log files with data mining techniques. As the popularity of the Web has exploded, there is a strong desire to 
understand user surfing behavior. However, it is difficult to perform user- oriented data mining and analysis directly on the server log 
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protected. Once user sessions are identified, data mining algorithms are then applied to discover the most common traversal paths 
and groups of pages frequently visited together. Important user browsing patterns are manifested through the frequent traversal 
paths and page groups, helping the understanding of user surfing behavior. Three types of reports are prepared: user-based reports, 
path-based reports and group-based reports. In this paper, we describe the design of SpeedTracer and demonstrate some of its 
features with a few sample reports. 
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